Emerging Themes in Epidemiology

نویسندگان

  • Fraser I Lewis
  • Michael P Ward
چکیده

Regression modelling is one of the most widely utilized approaches in epidemiological analyses. It provides a method of identifying statistical associations, from which potential causal associations relevant to disease control may then be investigated. Multivariable regression – a single dependent variable (outcome, usually disease) with multiple independent variables (predictors) – has long been the standard model. Generalizing multivariable regression to multivariate regression – all variables potentially statistically dependent – offers a far richer modelling framework. Through a series of simple illustrative examples we compare and contrast these approaches. The technical methodology used to implement multivariate regression is well established – Bayesian network structure discovery – and while a relative newcomer to the epidemiological literature has a long history in computing science. Applications of multivariate analysis in epidemiological studies can provide a greater understanding of disease processes at the population level, leading to the design of better disease control and prevention programs. Introduction Multivariable regression modelling in which multiple independent variables are regressed on a single dependent variable is a technique familiar to any epidemiologist. This analytical approach is a regular feature in the epidemiological literature, and is without doubt a useful tool. By extending this approach to an analogous multivariate regression model, in which all variables are simultaneously considered, substantially enhanced insight into the disease system under study may be gained. At worst, both multivariable and multivariate approaches will give identical results —as they must, because to determine the best possiblemultivariatemodel of study data, all possiblemultivariable models must also be considered, as the latter are simply special cases of the former. Gaining additional insights into a disease system by simply switching to a more general data analytic technique is clearly very attractive, in particular when the theoretical foundations for the more general approach are long established. The modelling methodology we consider here is referred to as Bayesian network analysis (as defined in [1,2]). This is a form of graphical modelling *Correspondence: [email protected] 1Section of Epidemiology, VetSuisse Faculty, University of Zürich, Winterthurerstrasse 270, Zürich, CH 8057, Switzerland Full list of author information is available at the end of the article [3,4], but whose focus is on structure discovery: determining an optimal statistical model, i.e. graphical structure, directly from observed data. Whilst relatively uncommon in the epidemiological literature, Bayesian network analyses are increasingly finding application in areas of biology, medicine and ecology (e.g. [5-12]) and Bayesian network modelling itself has a vast technical literature (as is easily seen by using the search term “Bayesian network” in any bibliographic database, e.g. pubmed, web of knowledge). Identifying causal relationships is the objective of many epidemiological analyses involving regression modelling. Empirical analyses of epidemiological data can demonstrate statistical dependency between variables, and as we later demonstrate Bayesian network analysis is ideally suited to such a task. While the identification of statistical dependency is often a natural step towards postulating causal mechanisms, it is, however, vastly more ambitious to further assert that any given dependency exists within a particular causal web. Expert knowledge and biological understanding is clearly essential, since this is more than a statistical data analysis exercise. To avoid any unnecessary confusion, all analyses and discussion here pertain only to models of statistical association —it is a common misinterpretation to assume that arcs in a Bayesian network model denote causality, they denote only statistical dependency. © 2013 Lewis and Ward; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 2 of 10 http://www.ete-online.com/content/10/1/4 Our objective here is to demonstrate the potential utility of Bayesian network structure discovery to epidemiologists.We consider specifically additive Bayesian networks, which are Bayesian network models parameterized in an analogous fashion to generalised linear models. The classical formulation of Bayesian networks for binary or multinomial variables uses a mathematically elegant contingency table parameterisation [1,2]. For epidemiological analyses such a parameterisation is both unusual and rather opaque, and is likely vastly over parameterized compared to the familiar additive formulation used in generalised linear models (as discussed in [13]). In the following sections we first briefly review themotivation and experimental origins of regression modelling in scientific studies. Graphical regression is then introduced, followed by a series of simple empirical examples which compare and contrast multivariable and multivariate regression. We then discuss the epidemiological implications of these results and the limitations of the approach. Regressionmodelling concepts a brief review In classical experimental trial scenarios (e.g. [14], such as factorial or Latin square designs), the investigator is able to fix at predetermined values all of the variables of interest in the experiment. These are the independent variables in a multivariable regression model. The research question being asked here is how the measurement variable – the outcome or response variable – changes across the various different patterns of values chosen for the independent variables. This is the historical foundation of regression modelling. The ability to fix all variables of interest to predetermined values is crucial and underlies the experimental study design, because it enables unambiguous estimation of all key covariate effects on the response variable. The classical experimental design scenario contrasts sharply with what is feasible and practical in many epidemiological studies, either in humans or other animals. Considering zoonotic pathogens for example, animal husbandry, livestock production and farm environment characteristics are by their nature highly inter-dependent. Thus, it is generally impossible to separate out the “true” effects of individual covariates on the response variable (e.g. the design matrix is not orthogonal, see [15]) because the estimated effect of any covariate will now generally also depend on what other covariates are also included in the model (including the case in which all variables are included). Moreover, determining the most appropriate covariates for inclusion in the model is considerably more difficult when dependencies exist between study variables, as in the case in which confounding variables are present. The Yule-Simpson paradox [16-18]: that an apparent relationship between variables (e.g. a disease and a putative risk factor) may disappear or even be reversed when other variables are taken into account, is particularly troublesome here. Similarly, the closely related difficulties of negative (or positive) confounding. In multivariable regression, relationships between the “independent” variables in a study do not feature explicitly in the modelling process. This seems entirely reasonable in the classical designed experiment scenario. In regression analyses of epidemiologic data where many interdependencies between study variables may be present, explicitly modelling all relationships between all variables is intuitively far more reasonable (as demonstrated in our later examples). Common multivariable model selection approaches, such as stepwise searches, may be sufficient to implicitly account for such inter-dependencies, and thus identify an optimal set of predictors for the outcome (disease) variable. But a considerable difficulty here is how to justify that the modelling results obtained are as optimal as is practicable for a given study. The standard way to address such issues in statistical modelling is to compare a simpler model with a more general model. If the goodness of fit of the simpler model is no worse than the more general model then the former is chosen as the preferred model. This is the concept of parsimony—it is more desirable to explain a phenomenon, e.g. disease occurrence, with a simpler than a complex model. In our current context “more general” also refers to expanding the scope of the modelling framework to explicitly include all relationships between all variables, i.e. a multivariate rather than multivariable regression model. The Bayesian network literature has long provided all the necessary theory and algorithms (e.g. [1,2,19,20]) to implement such regression modelling. Historically, the main practical difficulty in the application of this approach has been a lack of suitable computing resources and relevant accessible software. Regressionmodelling in epidemiology In typical regression analyses found in the epidemiological literature (e.g. [21,22]) the use of a hypothesis testing (P-values) framework is still far more common than Bayesian inference. There is a considerable body of evidence which strongly argues against the use of hypothesis testing and P-values for model comparison and selection. Information theoretic and Bayesian approaches are argued to be preferable on both conceptual and performance grounds [23-26]. When the primary objective is to identify optimal parsimonious models, i.e. structure discovery, then, in purely practical terms, using a Bayesian or non-Bayesian paradigm is largely irrelevant as in such analyses the use of uninformative or diffuse priors is the standard practice in structure discovery (e.g. see [1,2,19,20]). Hence, the actual parameter estimates in any given model will be almost identical to the maximum likelihood analogue. However, the very Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 3 of 10 http://www.ete-online.com/content/10/1/4 considerable advantage of adopting a Bayesian paradigm is that we can then directly utilize established model selection and comparison techniques from the Bayesian networks literature [2,19,20]. Empirical examples: multivariable versus multivariate We first briefly describe a graphical statistical model, recall that additive Bayesian network structure discovery is concerned entirely with graphical models, and its conceptual differences from classical regression. We then present three separate illustrative analyses using risk factor case study data (unpublished veterinary data with variable names anonymized to maintain confidentiality) comprising of 400 observations across 17 variables, where each variable is a measurement or attribute from an individual subject (animal) and each subject only appears once in the data. There are five binary variables and 12 continuous variables. Note that for our current purposes background knowledge of the particular variables in the study is not relevant, as we are only interested in comparing and contrasting the statistical results obtained by applying two different techniques to identical data. This is an observational study and the investigator was not able to fix the values of any of these variables. Introducing graphical regression In graphical statistical modelling there is no distinction made between covariates and a response variable. All are just “variables” as, formally speaking, a graphical statistical model is a representation of the joint probability distribution of all the random variables in the data. Figure 1(a) depicts a graphical model which is directly analogous to a classical multivariable regression model, as arcs terminate only at a single “response” variable (e.g. g5). But this model has a statistical interpretation which is radically different from that in classical regression, here: i) variables b3, b6, g9 and g10 are directly dependent with variable g5; ii) variables b3, b6, g9, g10 are all indirectly dependent with each other (via g5); and iii) all other variables are independent. In terms of i), direct dependence means there is an arc directly connecting these variables (in either direction). In terms of ii), in a graphical model all variables in the same component (collection of connected arcs —ignoring direction) are jointly statistically dependent. This means that knowing the value of one variable in this component can potentially provide information about the likely value of any other variable in this component (see [3,4]). If a variable has no arcs, either emanating from it or terminating at it, then it is statistically independent. In such a case knowing the value of any other variable in the model tells us nothing about the value of these variables. All the graphical models we consider here are concerned only with statistical dependency, and arc direction in such models in no way implies any causal relationship. The direction of arcs is a result of the probability calculus required when dealing with models comprising of joint probabilities. In general, arc direction has no epidemiological interpretation because observed data alone cannot discriminate between arcs of opposite directions. This is simply a consequence of factorising joint probability distributions, and is typically referred to as likelihood equivalence (see p.1052 in [11] for a more general explanation, and [2] for technical details). A potential practical complication of likelihood equivalence is when searching for an optimal graphical structure. Standard search approaches in the literature, such as Heckerman’s heuristic hill climber [2], and the exact order based search by Koivisto [20] (the latter is used in our later case studies), identify a single optimal (directed) graph. This is as opposed to all graphs within the same likelihood equivalence class, which is computationally intractable [2]. If the objective is to identify all statistical dependencies in study data then, as mentioned above, arc direction is not relevant and such difficulties can be ignored. This is not the case, however, when viewing the modelling results within a causal (or indeed a longitudinal) framework as the arc direction then has an obvious real interpretation. In causal analyses the use of a priori restrictions on arc directions to avoid contradicting known epidemiological fact is likely appropriate (although not without some conceptual challenges, see p212. in [2]). Causal analyses of data using graphical models represents a large, and somewhat distinct, literature from Bayesian networks, with [27] a standard text. In summary, classical multivariable regression can easily be denoted by a graphical model, but where the interpretation of the model is different in that it is now a joint probability model, albeit of a very simple structure. The reason for considering such regression models within a graphical modelling framework is that the graphical structure can now easily be relaxed to allow dependencies (arcs) to be present between any variables, i.e. this framework allows us to directly compare results from applying multivariable regression and multivariate regression on the same data. This then gives us our main “result” of this paper—a demonstration of how usingmultivariate regression may enhance our understanding of a disease system. Case study results We now present three analyses. In each we determine the globally optimal “multivariable” graphical regression model, and compare this with the globally optimal “multivariate” graphical regression model. The term “globally optimal” here refers to amodel which has the best possible Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 4 of 10 http://www.ete-online.com/content/10/1/4

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Emerging Themes in Epidemiology: Form and function

Emerging Themes in Epidemiology is a new, online Open Access journal. This editorial - which coincides with the Journal's launch - describes its unique review and publication model. The editorial board and review process of ETE will be managed by research degree students and will therefore be a training ground for students (though final editorial control rests with senior faculty Associate Edit...

متن کامل

Epidemiology in conflict – A call to arms

In this first special theme issue, Emerging Themes in Epidemiology publishes a collection of articles on the theme of Epidemiology in conflict. Violent conflict is an issue of great sensitivity within public health, but more structured research and reasoned discussion will allow us to better mitigate the public health impacts of war, and place the public health community in a more informed posi...

متن کامل

The birth of Emerging Themes in Epidemiology: a tale of Valerie, causality and epidemiology

Emerging Themes in Epidemiology (ETE) is a new, online, Open Access peer-reviewed journal. The Journal is unique in that it was conceived and is managed by research degree students in epidemiology and related public health fields. The Journal's management is overseen by its Editor-in-Chief and Associate Faculty Editors, all of whom are senior members of faculty. ETE aims to encourage debate and...

متن کامل

Seek, and ye shall find: Accessing the global epidemiological literature in different languages

The thematic series 'Beyond English: Accessing the global epidemiological literature' in Emerging Themes in Epidemiology highlights the wealth of epidemiological and public health literature in the major languages of the world, and the bibliographic databases through which they can be searched and accessed. This editorial suggests that all systematic reviews in epidemiology and public health sh...

متن کامل

Annual acknowledgement of manuscript reviewers 2014

The editor of Emerging Themes in Epidemiology would like to thank all our reviewers who have contributed to the journal in Volume 11 (2014).

متن کامل

Individual freedom versus collective responsibility: an ethicist's perspective

Philosophical theories of collective action have produced a number of alternative accounts of the rationality and morality of self-interest and altruism. These have obvious applications to communicable disease control, the avoidance of antibiotic resistance, the responsibility of healthcare professionals to patients with serious communicable diseases, and the sharing of personal data in epidemi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013